Computation and Language 85
★ Charting and Navigating Hugging Face's Model Atlas
As there are now millions of publicly available neural networks, searching
and analyzing large model repositories becomes increasingly important.
Navigating so many models requires an atlas, but as most models are poorly
documented charting such an atlas is challenging. To explore the hidden
potential of model repositories, we chart a preliminary atlas representing the
documented fraction of Hugging Face. It provides stunning visualizations of the
model landscape and evolution. We demonstrate several applications of this
atlas including predicting model attributes (e.g., accuracy), and analyzing
trends in computer vision models. However, as the current atlas remains
incomplete, we propose a method for charting undocumented regions.
Specifically, we identify high-confidence structural priors based on dominant
real-world model training practices. Leveraging these priors, our approach
enables accurate mapping of previously undocumented areas of the atlas. We
publicly release our datasets, code, and interactive atlas.
★ SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their
application in scientific problem-solving, yet their fine-grained capabilities
remain under-explored. In this paper, we introduce SciVerse, a multi-modal
scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test
instances in five distinct versions. We aim to investigate three key dimensions
of LMMs: scientific knowledge comprehension, multi-modal content
interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs
possess sufficient scientific expertise, we first transform each problem into
three versions containing different levels of knowledge required for solving,
i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret
multi-modal scientific content, we annotate another two versions, i.e.,
Vision-rich and -only, marking more question information from texts to
diagrams. Comparing the results of different versions, SciVerse systematically
examines the professional knowledge stock and visual perception skills of LMMs
in scientific domains. In addition, to rigorously assess CoT reasoning, we
propose a new scientific CoT evaluation strategy, conducting a step-wise
assessment on knowledge and logical errors in model outputs. Our extensive
evaluation of different LMMs on SciVerse reveals critical limitations in their
scientific proficiency and provides new insights into future developments.
Project page: https://sciverse-cuhk.github.io
comment: Initially released in September 2024. Project page:
https://sciverse-cuhk.github.io
★ Transformers without Normalization CVPR 2025
Normalization layers are ubiquitous in modern neural networks and have long
been considered essential. This work demonstrates that Transformers without
normalization can achieve the same or better performance using a remarkably
simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation
$DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization
layers in Transformers. DyT is inspired by the observation that layer
normalization in Transformers often produces tanh-like, $S$-shaped input-output
mappings. By incorporating DyT, Transformers without normalization can match or
exceed the performance of their normalized counterparts, mostly without
hyperparameter tuning. We validate the effectiveness of Transformers with DyT
across diverse settings, ranging from recognition to generation, supervised to
self-supervised learning, and computer vision to language models. These
findings challenge the conventional understanding that normalization layers are
indispensable in modern neural networks, and offer new insights into their role
in deep networks.
comment: CVPR 2025; Project page: https://jiachenzhu.github.io/DyT/
★ Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search ICLR 2025
We introduce Siege, a multi-turn adversarial framework that models the
gradual erosion of Large Language Model (LLM) safety through a tree search
perspective. Unlike single-turn jailbreaks that rely on one meticulously
engineered prompt, Siege expands the conversation at each turn in a
breadth-first fashion, branching out multiple adversarial prompts that exploit
partial compliance from previous responses. By tracking these incremental
policy leaks and re-injecting them into subsequent queries, Siege reveals how
minor concessions can accumulate into fully disallowed outputs. Evaluations on
the JailbreakBench dataset show that Siege achieves a 100% success rate on
GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries
than baselines such as Crescendo or GOAT. This tree search methodology offers
an in-depth view of how model safeguards degrade over successive dialogue
turns, underscoring the urgency of robust multi-turn testing procedures for
language models.
comment: Accepted to ICLR 2025 Trustworthy LLM
★ From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, Marcely Zanon Boito, André F. T. Martins
Large language models (LLMs) have shown remarkable performance and
generalization capabilities across multiple languages and tasks, making them
very attractive targets for multi-modality integration (e.g., images or
speech). In this work, we extend an existing LLM to the speech modality via
speech discretization and continued pre-training. In particular, we are
interested in multilingual LLMs, such as TOWER, as their pre-training setting
allows us to treat discretized speech input as an additional translation
language. The resulting open-source model, SPIRE, is able to transcribe and
translate English speech input while maintaining TOWER's original performance
on translation-related tasks, showcasing that discretized speech input
integration as an additional language is feasible during LLM adaptation. We
make our code and models available to the community.
★ Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models ICLR 2025
Adapting large language models to multiple tasks can cause cross-skill
interference, where improvements for one skill degrade another. While methods
such as LoRA impose orthogonality constraints at the weight level, they do not
fully address interference in hidden-state representations. We propose
Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel
representation-based approach that learns multiple orthonormal subspace
transformations, each specializing in a distinct skill, and composes them via a
lightweight router. By isolating these subspace edits in the hidden state,
rather than weight matrices, CS-ReFT prevents cross-task conflicts more
effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B
achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring
only 0.0098% of model parameters. These findings show that specialized
representation edits, composed via a simple router, significantly enhance
multi-task instruction following with minimal overhead.
comment: Accepted to ICLR 2025 SCOPE
★ TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Object Hallucination (OH) has been acknowledged as one of the major
trustworthy challenges in Large Vision-Language Models (LVLMs). Recent
advancements in Large Language Models (LLMs) indicate that internal states,
such as hidden states, encode the "overall truthfulness" of generated
responses. However, it remains under-explored how internal states in LVLMs
function and whether they could serve as "per-token" hallucination indicators,
which is essential for mitigating OH. In this paper, we first conduct an
in-depth exploration of LVLM internal states in relation to OH issues and
discover that (1) LVLM internal states are high-specificity per-token
indicators of hallucination behaviors. Moreover, (2) different LVLMs encode
universal patterns of hallucinations in common latent subspaces, indicating
that there exist "generic truthful directions" shared by various LVLMs. Based
on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt)
that first learns the truthful direction of LVLM decoding and then applies
truthful-guided inference-time intervention during LVLM decoding. We further
propose ComnHallu to enhance both cross-LVLM and cross-data hallucination
detection transferability by constructing and aligning hallucination latent
subspaces. We evaluate TruthPrInt in extensive experimental settings, including
in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks.
Experimental results indicate that TruthPrInt significantly outperforms
state-of-the-art methods. Codes will be available at
https://github.com/jinhaoduan/TruthPrInt.
comment: 15 pages, 9 figures, the first two authors contributed equally
★ VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Vision-Language Models have made significant progress on many
perception-focused tasks, however, their progress on reasoning-focused tasks
seem to be limited due to the lack of high-quality and diverse training data.
In this work, we aim to address the scarcity issue of reasoning-focused
multimodal datasets. We propose VisualWebInstruct - a novel approach that
leverages search engine to create a diverse, and high-quality dataset spanning
multiple disciplines like math, physics, finance, chemistry, etc. Starting with
meticulously selected 30,000 seed images, we employ Google Image search to
identify websites containing similar images. We collect and process the HTMLs
from over 700K unique URL sources. Through a pipeline of content extraction,
filtering and synthesis, we build a dataset of approximately 900K
question-answer pairs, with 40% being visual QA pairs and the rest as text QA
pairs. Models fine-tuned on VisualWebInstruct demonstrate significant
performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point
gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain.
Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B
parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath
(55.7%). These remarkable results highlight the effectiveness of our dataset in
enhancing VLMs' reasoning capabilities for complex multimodal tasks.
comment: Technical Report
★ Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More ICLR 2025
This work concerns the path-star task, a minimal example of searching over a
graph. The graph, $G$, is star-shaped with $D$ arms radiating from a start
node, $s$. A language model (LM) is given $G$, $s$, and a target node $t$,
which ends one of the arms and is tasked with generating the arm containing
$t$. The minimal nature of this task means only a single choice needs to be
made: which of the $D$ arms contains $t$?
Decoder-only LMs fail to solve this elementary task above $1/D$ chance due to
a learned shortcut that absorbs training supervision. We show how this
pathology is caused by excess supervision and we present a series of solutions
demonstrating that the task is solvable via decoder-only LMs. We find that the
task's minimal nature causes its difficulty, as it prevents task decomposition.
Our solutions provide insight into the pathology and its implications for LMs
trained via next-token prediction.
comment: A reduced version of this work has been accepted to the Workshop on
Spurious Correlation and Shortcut Learning: Foundations and Solutions (SCSL)
at ICLR 2025. Full version under review
★ The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory
High-quality test items are essential for educational assessments,
particularly within Item Response Theory (IRT). Traditional validation methods
rely on resource-intensive pilot testing to estimate item difficulty and
discrimination. More recently, Item-Writing Flaw (IWF) rubrics emerged as a
domain-general approach for evaluating test items based on textual features.
However, their relationship to IRT parameters remains underexplored. To address
this gap, we conducted a study involving over 7,000 multiple-choice questions
across various STEM subjects (e.g., math and biology). Using an automated
approach, we annotated each question with a 19-criteria IWF rubric and studied
relationships to data-driven IRT parameters. Our analysis revealed
statistically significant links between the number of IWFs and IRT difficulty
and discrimination parameters, particularly in life and physical science
domains. We further observed how specific IWF criteria can impact item quality
more and less severely (e.g., negative wording vs. implausible distractors).
Overall, while IWFs are useful for predicting IRT parameters--particularly for
screening low-difficulty MCQs--they cannot replace traditional data-driven
validation methods. Our findings highlight the need for further research on
domain-general evaluation rubrics and algorithms that understand
domain-specific content for robust item validation.
★ Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set
Discourse understanding is essential for many NLP tasks, yet most existing
work remains constrained by framework-dependent discourse representations. This
work investigates whether large language models (LLMs) capture discourse
knowledge that generalizes across languages and frameworks. We address this
question along two dimensions: (1) developing a unified discourse relation
label set to facilitate cross-lingual and cross-framework discourse analysis,
and (2) probing LLMs to assess whether they encode generalizable discourse
abstractions. Using multilingual discourse relation classification as a
testbed, we examine a comprehensive set of 23 LLMs of varying sizes and
multilingual capabilities. Our results show that LLMs, especially those with
multilingual training corpora, can generalize discourse information across
languages and frameworks. Further layer-wise analyses reveal that language
generalization at the discourse level is most salient in the intermediate
layers. Lastly, our error analysis provides an account of challenging relation
classes.
comment: 18 pages, 7 figures, 3 tables, code:
https://github.com/mainlp/discourse_probes
★ MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Weihao Xuan, Rui Yang, Heli Qi, Qingcheng Zeng, Yunze Xiao, Yun Xing, Junjue Wang, Huitao Li, Xin Li, Kunyu Yu, Nan Liu, Qingyu Chen, Douglas Teodoro, Edison Marrese-Taylor, Shijian Lu, Yusuke Iwasawa, Yutaka Matsuo, Irene Li
Traditional benchmarks struggle to evaluate increasingly sophisticated
language models in multilingual and culturally diverse contexts. To address
this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark
covering 13 typologically diverse languages with approximately 11,829 questions
per language. Building on the challenging reasoning-focused design of MMLU-Pro,
our framework employs a semi-automatic translation process: translations
generated by state-of-the-art large language models (LLMs) are rigorously
evaluated by expert annotators to ensure conceptual accuracy, terminological
consistency, and cultural relevance. We comprehensively evaluate 25
state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot
prompting strategies, analyzing their performance across linguistic and
cultural boundaries. Our experiments reveal consistent performance degradation
from high-resource languages to lower-resource ones, with the best models
achieving over 70% accuracy on English but dropping to around 40% for languages
like Swahili, highlighting persistent gaps in multilingual capabilities despite
recent advances. MMLU-ProX is an ongoing project; we are expanding our
benchmark by incorporating additional languages and evaluating more language
models to provide a more comprehensive assessment of multilingual capabilities.
★ Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents
LLMs have paved the way for truly simple document-level machine translation,
but challenges such as omission errors remain. In this paper, we study a simple
method for handling document-level machine translation, by leveraging previous
contexts in a multi-turn conversational manner. Specifically, by decomposing
documents into segments and iteratively translating them while maintaining
previous turns, this method ensures coherent translations without additional
training, and can fully re-use the KV cache of previous turns thus minimizing
computational overhead. We further propose a `source-primed' method that first
provides the whole source document before multi-turn translation. We
empirically show this multi-turn method outperforms both translating entire
documents in a single turn and translating each segment independently according
to multiple automatic metrics in representative LLMs, establishing a strong
baseline for document-level translation using LLMs.
comment: 9 pages, 2 figures
★ LLMs in Disease Diagnosis: A Comparative Study of DeepSeek-R1 and O3 Mini Across Chronic Health Conditions
Large Language Models (LLMs) are revolutionizing medical diagnostics by
enhancing both disease classification and clinical decision-making. In this
study, we evaluate the performance of two LLM- based diagnostic tools, DeepSeek
R1 and O3 Mini, using a structured dataset of symptoms and diagnoses. We
assessed their predictive accuracy at both the disease and category levels, as
well as the reliability of their confidence scores. DeepSeek R1 achieved a
disease-level accuracy of 76% and an overall accuracy of 82%, outperforming O3
Mini, which attained 72% and 75% respectively. Notably, DeepSeek R1
demonstrated exceptional performance in Mental Health, Neurological Disorders,
and Oncology, where it reached 100% accuracy, while O3 Mini excelled in
Autoimmune Disease classification with 100% accuracy. Both models, however,
struggled with Respiratory Disease classification, recording accuracies of only
40% for DeepSeek R1 and 20% for O3 Mini. Additionally, the analysis of
confidence scores revealed that DeepSeek R1 provided high-confidence
predictions in 92% of cases, compared to 68% for O3 Mini. Ethical
considerations regarding bias, model interpretability, and data privacy are
also discussed to ensure the responsible integration of LLMs into clinical
practice. Overall, our findings offer valuable insights into the strengths and
limitations of LLM-based diagnostic systems and provide a roadmap for future
enhancements in AI-driven healthcare.
comment: 12 pages, 3 figures
★ World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
Recent advances in large vision-language models (LVLMs) have shown promise
for embodied task planning, yet they struggle with fundamental challenges like
dependency constraints and efficiency. Existing approaches either solely
optimize action selection or leverage world models during inference,
overlooking the benefits of learning to model the world as a way to enhance
planning capabilities. We propose Dual Preference Optimization (D$^2$PO), a new
learning framework that jointly optimizes state prediction and action selection
through preference learning, enabling LVLMs to understand environment dynamics
for better planning. To automatically collect trajectories and stepwise
preference data without human annotation, we introduce a tree search mechanism
for extensive exploration via trial-and-error. Extensive experiments on
VoTa-Bench demonstrate that our D$^2$PO-based method significantly outperforms
existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and
LLaMA-3.2 (11B), achieving superior task success rates with more efficient
execution paths.
★ Statistical Analysis of Sentence Structures through ASCII, Lexical Alignment and PCA
While utilizing syntactic tools such as parts-of-speech (POS) tagging has
helped us understand sentence structures and their distribution across diverse
corpora, it is quite complex and poses a challenge in natural language
processing (NLP). This study focuses on understanding sentence structure
balance - usages of nouns, verbs, determiners, etc - harmoniously without
relying on such tools. It proposes a novel statistical method that uses
American Standard Code for Information Interchange (ASCII) codes to represent
text of 11 text corpora from various sources and their lexical category
alignment after using their compressed versions through PCA, and analyzes the
results through histograms and normality tests such as Shapiro-Wilk and
Anderson-Darling Tests. By focusing on ASCII codes, this approach simplifies
text processing, although not replacing any syntactic tools but complementing
them by offering it as a resource-efficient tool for assessing text balance.
The story generated by Grok shows near normality indicating balanced sentence
structures in LLM outputs, whereas 4 out of the remaining 10 pass the normality
tests. Further research could explore potential applications in text quality
evaluation and style analysis with syntactic integration for more broader
tasks.
★ Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
This paper presents our work on the Light-R1 series, with models, data, and
code all released.
We first focus on training long COT models from scratch, specifically
starting from models initially lacking long COT capabilities. Using a
curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO,
we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in
superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite
being trained exclusively on math data, Light-R1-32B shows strong
generalization across other domains. In the subsequent phase of this work, we
highlight the significant benefit of the 3k dataset constructed for the second
SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled
models using this dataset, we obtain new SOTA models in 7B and 14B, while the
32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1.
Furthermore, we extend our work by applying reinforcement learning,
specifically GRPO, on long-COT models to further improve reasoning performance.
We successfully train our final Light-R1-14B-DS with RL, achieving SOTA
performance among 14B parameter models in math. With AIME24 & 25 scores of 74.0
and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and
DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected
behavior, showing simultaneous increase in response length and reward score.
The Light-R1 series of work validates training long-COT models from scratch,
showcases the art in SFT data and releases SOTA models from RL.
comment: all release at https://github.com/Qihoo360/Light-R1
★ DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
The rapid advancement of large language models (LLMs) has significantly
improved their performance in code generation tasks. However, existing code
benchmarks remain static, consisting of fixed datasets with predefined
problems. This makes them vulnerable to memorization during training, where
LLMs recall specific test cases instead of generalizing to new problems,
leading to data contamination and unreliable evaluation results. To address
these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that
overcomes the limitations of static datasets. DynaCode evaluates LLMs
systematically using a complexity-aware metric, incorporating both code
complexity and call-graph structures. DynaCode achieves large-scale diversity,
generating up to 189 million unique nested code problems across four distinct
levels of code complexity, referred to as units, and 16 types of call graphs.
Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7%
compared to MBPP+, a static code generation benchmark, with performance
progressively decreasing as complexity increases. This demonstrates DynaCode's
ability to effectively differentiate LLMs. Additionally, by leveraging call
graphs, we gain insights into LLM behavior, particularly their preference for
handling subfunction interactions within nested code.
comment: 16 pages, 11 figures
★ BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models
In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave)
beam prediction framework leveraging large language models (LLMs) to address
the challenges of high training overhead and latency in mmWave communication
systems. By combining computer vision (CV) with LLMs' cross-modal reasoning
capabilities, the framework extracts user equipment (UE) positional features
from RGB images and aligns visual-temporal features with LLMs' semantic space
through reprogramming techniques. Evaluated on a realistic
vehicle-to-infrastructure (V2I) scenario, the proposed method achieves 61.01%
top-1 accuracy and 97.39% top-3 accuracy in standard prediction tasks,
significantly outperforming traditional deep learning models. In few-shot
prediction scenarios, the performance degradation is limited to 12.56% (top-1)
and 5.55% (top-3) from time sample 1 to 10, demonstrating superior prediction
capability.
comment: 6 pages, 7 figures, conference
★ VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan
In this paper, we propose a comprehensive evaluation benchmark for Visual
Language Models (VLM) in Traditional Chinese. Our evaluation suite, the first
of its kind, contains two complementary components: (1) VisTai-MCQ, a
collection of manually curated exam multi-choice questions from 21 academic
subjects designed to test the broad knowledge and reasoning capabilities of
VLMs; and (2) VisTai-Dialogue, an open dialogue benchmark comprising 131
image-question pairs manually created to evaluate VLMs' ability in free-form
dialogue generation within Taiwanese cultural contexts. These benchmarks
address a critical gap in the evaluation landscape, where existing benchmarks
predominantly focus on English or Simplified Chinese, neglecting the unique
linguistic and cultural aspects of Traditional Chinese used in regions like
Taiwan and Hong Kong. Our analysis reveals significant performance differences
across various VLMs and highlights specific challenges in processing
Traditional Chinese visual content.
★ Understanding the Logical Capabilities of Large Language Models via Out-of-Context Representation Learning
We study the capabilities of Large Language Models (LLM) on binary relations,
a ubiquitous concept in math employed in most reasoning, math and logic
benchmarks. This work focuses on equality, inequality, and inclusion, along
with the properties they satisfy, such as ir/reflexivity, a/symmetry,
transitivity, and logical complexity (e.g., number of reasoning ``hops''). We
propose an alternative to in-context learning that trains only the
representations of newly introduced tokens, namely out-of-context
representation learning. This method mitigates linguistic biases already
present in a model and, differently from in-context learning, does not rely on
external information or illustrations. We argue out-of-context representation
learning as a better alternative to in-context learning and fine-tuning to
evaluate the capabilities of LLMs on logic tasks that are the building blocks
of more complex reasoning benchmarks.
★ G-Boost: Boosting Private SLMs with General LLMs
Due to the limited computational resources, most Large Language Models (LLMs)
developers can only fine-tune Small Language Models (SLMs) on their own data.
These private SLMs typically have limited effectiveness. To boost the
performance of private SLMs, this paper proposes to ask general LLMs for help.
The general LLMs can be APIs or larger LLMs whose inference cost the developers
can afford. Specifically, we propose the G-Boost framework where a private SLM
adaptively performs collaborative inference with a general LLM under the guide
of process reward. Experiments demonstrate that our framework can significantly
boost the performance of private SLMs.
★ Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark
Viktor Moskvoretskii, Alina Lobanova, Ekaterina Neminova, Chris Biemann, Alexander Panchenko, Irina Nikishina
This paper explores the feasibility of using text-to-image models in a
zero-shot setup to generate images for taxonomy concepts. While text-based
methods for taxonomy enrichment are well-established, the potential of the
visual dimension remains unexplored. To address this, we propose a
comprehensive benchmark for Taxonomy Image Generation that assesses models'
abilities to understand taxonomy concepts and generate relevant, high-quality
images. The benchmark includes common-sense and randomly sampled WordNet
concepts, alongside the LLM generated predictions. The 12 models are evaluated
using 9 novel taxonomy-related text-to-image metrics and human feedback.
Moreover, we pioneer the use of pairwise evaluation with GPT-4 feedback for
image generation. Experimental results show that the ranking of models differs
significantly from standard T2I tasks. Playground-v2 and FLUX consistently
outperform across metrics and subsets and the retrieval-based approach performs
poorly. These findings highlight the potential for automating the curation of
structured data resources.
comment: Labeled data and generated image Wordnet are published at
https://huggingface.co/collections/VityaVitalich/generated-image-wordnet-67d2c868ff1414ec2f8e0d3d
★ A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization
Automatic patent summarization approaches that help in the patent analysis
and comprehension procedure are in high demand due to the colossal growth of
innovations. The development of natural language processing (NLP), text mining,
and deep learning has notably amplified the efficacy of text summarization
models for abundant types of documents. Summarizing patent text remains a
pertinent challenge due to the labyrinthine writing style of these documents,
which includes technical and legal intricacies. Additionally, these patent
document contents are considerably lengthier than archetypal documents, which
intricates the process of extracting pertinent information for summarization.
Embodying extractive and abstractive text summarization methodologies into a
hybrid framework, this study proposes a system for efficiently creating
abstractive summaries of patent records. The procedure involves leveraging the
LexRank graph-based algorithm to retrieve the important sentences from input
parent texts, then utilizing a Bidirectional Auto-Regressive Transformer (BART)
model that has been fine-tuned using Low-Ranking Adaptation (LoRA) for
producing text summaries. This is accompanied by methodical testing and
evaluation strategies. Furthermore, the author employed certain meta-learning
techniques to achieve Domain Generalization (DG) of the abstractive component
across multiple patent fields.
comment: Accepted Paper in the 8th International Research Conference on Smart
Computing and Systems Engineering, University of Kelaniya, Sri Lanka.
(Pending Publication)
★ New Trends for Modern Machine Translation with Large Reasoning Models
Recent advances in Large Reasoning Models (LRMs), particularly those
leveraging Chain-of-Thought reasoning (CoT), have opened brand new possibility
for Machine Translation (MT). This position paper argues that LRMs
substantially transformed traditional neural MT as well as LLMs-based MT
paradigms by reframing translation as a dynamic reasoning task that requires
contextual, cultural, and linguistic understanding and reasoning. We identify
three foundational shifts: 1) contextual coherence, where LRMs resolve
ambiguities and preserve discourse structure through explicit reasoning over
cross-sentence and complex context or even lack of context; 2) cultural
intentionality, enabling models to adapt outputs by inferring speaker intent,
audience expectations, and socio-linguistic norms; 3) self-reflection, LRMs can
perform self-reflection during the inference time to correct the potential
errors in translation especially extremely noisy cases, showing better
robustness compared to simply mapping X->Y translation. We explore various
scenarios in translation including stylized translation, document-level
translation and multimodal translation by showcasing empirical examples that
demonstrate the superiority of LRMs in translation. We also identify several
interesting phenomenons for LRMs for MT including auto-pivot translation as
well as the critical challenges such as over-localisation in translation and
inference efficiency. In conclusion, we think that LRMs redefine translation
systems not merely as text converters but as multilingual cognitive agents
capable of reasoning about meaning beyond the text. This paradigm shift reminds
us to think of problems in translation beyond traditional translation scenarios
in a much broader context with LRMs - what we can achieve on top of it.
★ KV-Distill: Nearly Lossless Learnable Context Compression for LLMs
Sequence-to-sequence tasks often benefit from long contexts, but the
quadratic complexity of self-attention in standard Transformers renders this
non-trivial. During generation, temporary representations -stored in the
so-called KV cache-account for a large portion of GPU memory usage and scale
linearly with context length. We introduce KV-Distill, a Transformer
compression framework that distills long context KV caches into significantly
shorter representations in a question-independent fashion. KV-Distill can be
trained as a parameter-efficient adaptor for pretrained models, and enables the
compression of arbitrary spans of a context while preserving pre-trained model
capabilities. We treat a compressed-uncompressed cache as a student-teacher
pairing and apply a KL-type divergence to match the generated outputs.
KV-Distill outperforms other compression techniques in worst-case extractive
tasks and approaches uncompressed performance in long context question
answering and summarization, and it can be fine-tuned on domain-specific
contexts to reduce lengths by up to 99% while preserving downstream
performance. We demonstrate the generalizability of KV-Distill across various
model sizes and architectures.
★ OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions
Open Semantic Mapping (OSM) is a key technology in robotic perception,
combining semantic segmentation and SLAM techniques. This paper introduces a
dynamically configurable and highly automated LLM/LVLM-powered pipeline for
evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark).
The study focuses on evaluating state-of-the-art semantic mapping algorithms
under varying indoor lighting conditions, a critical challenge in indoor
environments. We introduce a novel dataset with simulated RGB-D sequences and
ground truth 3D reconstructions, facilitating the rigorous analysis of mapping
performance across different lighting conditions. Through experiments on
leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the
semantic fidelity of object recognition and segmentation. Additionally, we
introduce a Scene Graph evaluation method to analyze the ability of models to
interpret semantic structure. The results provide insights into the robustness
of these models, forming future research directions for developing resilient
and adaptable robotic systems. Our code is available at
https://be2rlab.github.io/OSMa-Bench/.
comment: Project page: https://be2rlab.github.io/OSMa-Bench/
★ Wikipedia is Not a Dictionary, Delete! Text Classification as a Proxy for Analysing Wiki Deletion Discussions
Automated content moderation for collaborative knowledge hubs like Wikipedia
or Wikidata is an important yet challenging task due to multiple factors. In
this paper, we construct a database of discussions happening around articles
marked for deletion in several Wikis and in three languages, which we then use
to evaluate a range of LMs on different tasks (from predicting the outcome of
the discussion to identifying the implicit policy an individual comment might
be pointing to). Our results reveal, among others, that discussions leading to
deletion are easier to predict, and that, surprisingly, self-produced tags
(keep, delete or redirect) don't always help guiding the classifiers,
presumably because of users' hesitation or deliberation within comments.
comment: Accepted to WNUT-2025
★ VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM)
with 8B parameters, which improves the reasoning abilities of existing
Multimodal Large Language Models (MLLMs) across different model scales and
families with Best-of-N (BoN) evaluation strategies. Specifically, our model
improves the reasoning performance of three types of MLLMs and four different
model scales. Even when applied to the highly capable InternVL2.5-78B, it
achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.
Experimental results show that our model exhibits superior performance compared
to Outcome Reward Models and Self-Consistency during BoN evaluation. To
facilitate the training of multimodal PRMs, we construct a multimodal process
supervision dataset VisualPRM400K using an automated data pipeline. For the
evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with
human-annotated step-wise correctness labels, to measure the abilities of PRMs
to detect erroneous steps in multimodal reasoning tasks. We hope that our work
can inspire more future research and contribute to the development of MLLMs.
Our model, data, and benchmark are released in
https://internvl.github.io/blog/2025-03-13-VisualPRM/.
★ An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, and Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, and Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, and Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, and Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, and Amanda Myntti, Dayyán O'Brien, Stephan Oepen, Proyag Pal, Jousia Piha, and Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, and Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, Jaume Zaragoza-Bernabeu
Training state-of-the-art large language models requires vast amounts of
clean and diverse textual data. However, building suitable multilingual
datasets remains a challenge. In this work, we present HPLT v2, a collection of
high-quality multilingual monolingual and parallel corpora. The monolingual
portion of the data contains 8T tokens covering 193 languages, while the
parallel data contains 380M sentence pairs covering 51 languages. We document
the entire data pipeline and release the code to reproduce it. We provide
extensive analysis of the quality and characteristics of our data. Finally, we
evaluate the performance of language models and machine translation systems
trained on HPLT v2, demonstrating its value.
★ MinorBench: A hand-built benchmark for content-based risks for children
Large Language Models (LLMs) are rapidly entering children's lives - through
parent-driven adoption, schools, and peer networks - yet current AI ethics and
safety research do not adequately address content-related risks specific to
minors. In this paper, we highlight these gaps with a real-world case study of
an LLM-based chatbot deployed in a middle school setting, revealing how
students used and sometimes misused the system. Building on these findings, we
propose a new taxonomy of content-based risks for minors and introduce
MinorBench, an open-source benchmark designed to evaluate LLMs on their ability
to refuse unsafe or inappropriate queries from children. We evaluate six
prominent LLMs under different system prompts, demonstrating substantial
variability in their child-safety compliance. Our results inform practical
steps for more robust, child-focused safety mechanisms and underscore the
urgency of tailoring AI systems to safeguard young users.
★ ARLED: Leveraging LED-based ARMAN Model for Abstractive Summarization of Persian Long Documents
The increasing volume of textual data poses challenges in reading and
comprehending large documents, particularly for scholars who need to extract
useful information from research articles. Automatic text summarization has
emerged as a powerful tool to condense lengthy documents into concise and
informative summaries. Depending on the approach used, text summarization can
be categorized as either extractive or abstractive. While extractive methods
are commonly used due to their simplicity, they often miss important
information. On the other hand, Abstractive Summarization can generate more
coherent and informative summaries by understanding the underlying meaning of
the text. Abstractive techniques have gained attention in various languages,
and recent advancements have been achieved through pre-training models such as
BERT, BART, and T5. However, the challenge of summarizing long documents
remains, and alternative models like Longformer have been introduced to address
this limitation. In this context, this paper focuses on abstractive
summarization in the Persian language. The authors introduce a new dataset of
300,000 full-text Persian papers obtained from the Ensani website and apply the
ARMAN model, based on the Longformer architecture, to generate summaries. The
experimental results demonstrate promising performance in Persian text
summarization. The paper provides a comprehensive overview of related work,
discusses the methodology, presents the experimental results, and concludes
with future research directions.
comment: 11 pages, 3 tables
★ R.U.Psycho? Robust Unified Psychometric Testing of Language Models
Generative language models are increasingly being subjected to psychometric
questionnaires intended for human testing, in efforts to establish their
traits, as benchmarks for alignment, or to simulate participants in social
science experiments. While this growing body of work sheds light on the
likeness of model responses to those of humans, concerns are warranted
regarding the rigour and reproducibility with which these experiments may be
conducted. Instabilities in model outputs, sensitivity to prompt design,
parameter settings, and a large number of available model versions increase
documentation requirements. Consequently, generalization of findings is often
complex and reproducibility is far from guaranteed. In this paper, we present
R.U.Psycho, a framework for designing and running robust and reproducible
psychometric experiments on generative language models that requires limited
coding expertise. We demonstrate the capability of our framework on a variety
of psychometric questionnaires, which lend support to prior findings in the
literature. R.U.Psycho is available as a Python package at
https://github.com/julianschelb/rupsycho.
★ Assessing the validity of new paradigmatic complexity measures as criterial features for proficiency in L2 writings in English
Cyriel Mallart, Andrew Simpkin, Nicolas Ballier, Paula Lissón, Rémi Venant, Jen-Yu Li, Bernardo Stearns, Thomas Gaillat
This article addresses Second Language (L2) writing development through an
investigation of new grammatical and structural complexity metrics. We explore
the paradigmatic production in learner English by linking language functions to
specific grammatical paradigms. Using the EFCAMDAT as a gold standard and a
corpus of French learners as an external test set, we employ a supervised
learning framework to operationalise and evaluate seven microsystems. We show
that learner levels are associated with the seven microsystems (MS). Using
ordinal regression modelling for evaluation, the results show that all MS are
significant but yield a low impact if taken individually. However, their
influence is shown to be impactful if taken as a group. These microsystems and
their measurement method suggest that it is possible to use them as part of
broader-purpose CALL systems focused on proficiency assessment.
★ Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
Recent advancement of large language models (LLMs) has led to significant
breakthroughs across various tasks, laying the foundation for the development
of LLM-based speech translation systems. Existing methods primarily focus on
aligning inputs and outputs across modalities while overlooking deeper semantic
alignment within model representations. To address this limitation, we propose
an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality
gap by explicitly aligning speech and text representations at selected layers
within LLMs. To achieve this, we leverage the optimal transport (OT) theory to
quantify fine-grained representation discrepancies between speech and text.
Furthermore, we utilize the cross-modal retrieval technique to identify the
layers that are best suited for alignment and perform joint training on these
layers. Experimental results on speech translation (ST) tasks demonstrate that
AI-STA significantly improves the translation performance of large speech-text
models (LSMs), outperforming previous state-of-the-art approaches. Our findings
highlight the importance of inner-layer speech-text alignment in LLMs and
provide new insights into enhancing cross-modal learning.
comment: 12 pages, 7 figures
★ Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives
Miguel Romero-Arjona, Pablo Valle, Juan C. Alonso, Ana B. Sánchez, Miriam Ugarte, Antonia Cazalilla, Vicente Cambrón, José A. Parejo, Aitor Arrieta, Sergio Segura
The battle for AI leadership is on, with OpenAI in the United States and
DeepSeek in China as key contenders. In response to these global trends, the
Spanish government has proposed ALIA, a public and transparent AI
infrastructure incorporating small language models designed to support Spanish
and co-official languages such as Basque. This paper presents the results of
Red Teaming sessions, where ten participants applied their expertise and
creativity to manually test three of the latest models from these
initiatives$\unicode{x2013}$OpenAI o3-mini, DeepSeek R1, and ALIA
Salamandra$\unicode{x2013}$focusing on biases and safety concerns. The results,
based on 670 conversations, revealed vulnerabilities in all the models under
test, with biased or unsafe responses ranging from 29.5% in o3-mini to 50.6% in
Salamandra. These findings underscore the persistent challenges in developing
reliable and trustworthy AI systems, particularly those intended to support
Spanish and Basque languages.
★ PRISM: Preference Refinement via Implicit Scene Modeling for 3D Vision-Language Preference-Based Reinforcement Learning
We propose PRISM, a novel framework designed to overcome the limitations of
2D-based Preference-Based Reinforcement Learning (PBRL) by unifying 3D point
cloud modeling and future-aware preference refinement. At its core, PRISM
adopts a 3D Point Cloud-Language Model (3D-PC-LLM) to mitigate occlusion and
viewpoint biases, ensuring more stable and spatially consistent preference
signals. Additionally, PRISM leverages Chain-of-Thought (CoT) reasoning to
incorporate long-horizon considerations, thereby preventing the short-sighted
feedback often seen in static preference comparisons. In contrast to
conventional PBRL techniques, this integration of 3D perception and
future-oriented reasoning leads to significant gains in preference agreement
rates, faster policy convergence, and robust generalization across unseen
robotic environments. Our empirical results, spanning tasks such as robotic
manipulation and autonomous navigation, highlight PRISM's potential for
real-world applications where precise spatial understanding and reliable
long-term decision-making are critical. By bridging 3D geometric awareness with
CoT-driven preference modeling, PRISM establishes a comprehensive foundation
for scalable, human-aligned reinforcement learning.
★ "Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding
Large language models (LLMs) exhibit strong reasoning abilities, often
attributed to few-shot or zero-shot chain-of-thought (CoT) prompting. While
effective, these methods require labor-intensive prompt engineering, raising
the question of whether reasoning can be induced without reliance on explicit
prompts. In this work, we unlock the reasoning capabilities of LLMs without
explicit prompting. Inspired by zero-shot CoT and CoT-decoding, we propose a
novel decoding strategy that systematically nudges LLMs to continue reasoning,
thereby preventing immature reasoning processes. Specifically, we monitor the
model's generation and inject a designated phrase whenever it is likely to
conclude its response prematurely, before completing the reasoning process. Our
experimental evaluations on diverse reasoning benchmarks demonstrate that our
proposed strategy substantially improves LLM reasoning capabilities,
highlighting the potential of decoding-based interventions as an alternative to
traditional prompting techniques.
★ Retrieval-Augmented Generation with Hierarchical Knowledge
Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
Graph-based Retrieval-Augmented Generation (RAG) methods have significantly
enhanced the performance of large language models (LLMs) in domain-specific
tasks. However, existing RAG methods do not adequately utilize the naturally
inherent hierarchical knowledge in human cognition, which limits the
capabilities of RAG systems. In this paper, we introduce a new RAG approach,
called HiRAG, which utilizes hierarchical knowledge to enhance the semantic
understanding and structure capturing capabilities of RAG systems in the
indexing and retrieval processes. Our extensive experiments demonstrate that
HiRAG achieves significant performance improvements over the state-of-the-art
baseline methods. The code of our proposed method is available at
\href{https://github.com/hhy-huang/HiRAG}{https://github.com/hhy-huang/HiRAG}.
★ Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Speculative decoding (SPD) aims to accelerate the auto-regressive token
generation process of a target Large Language Model (LLM). Some approaches
employ a draft model with multiple heads to predict a sequence of future
tokens, where each head handles a token in the sequence. The target LLM
verifies the predicted sequence and accepts aligned tokens, enabling efficient
multi-token generation. However, existing methods assume that all tokens within
a sequence are equally important, employing identical head structures and
relying on a single-generation paradigm, either serial or parallel. To this
end, we theoretically demonstrate that initial tokens in the draft sequence are
more important than later ones. Building on this insight, we propose Gumiho, a
hybrid model combining serial and parallel heads. Specifically, given the
critical importance of early tokens, we employ a sophisticated Transformer
architecture for the early draft heads in a serial configuration to improve
accuracy. For later tokens, we utilize multiple lightweight MLP heads operating
in parallel to enhance efficiency. By allocating more advanced model structures
and longer running times to the early heads, Gumiho achieves improved overall
performance. The experimental results demonstrate that our method outperforms
existing approaches, fully validating its effectiveness.
comment: Paper under review
★ Cognitive-Mental-LLM: Leveraging Reasoning in Large Language Models for Mental Health Prediction via Online Text
Large Language Models (LLMs) have demonstrated potential in predicting mental
health outcomes from online text, yet traditional classification methods often
lack interpretability and robustness. This study evaluates structured reasoning
techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and
Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental
health datasets sourced from Reddit. We analyze reasoning-driven prompting
strategies, including Zero-shot CoT and Few-shot CoT, using key performance
metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our
findings indicate that reasoning-enhanced techniques improve classification
performance over direct prediction, particularly in complex cases. Compared to
baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained
transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs
such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable
gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and
SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in
Depression Severity, and CSSRS predictions suggest dataset-specific
limitations, likely due to our using a more extensive test set. Among prompting
strategies, Few-shot CoT consistently outperforms others, reinforcing the
effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability
highlights challenges in model reliability and interpretability. This study
provides a comprehensive benchmark of reasoning-based LLM techniques for mental
health text classification. It offers insights into their potential for
scalable clinical applications while identifying key challenges for future
improvements.
comment: 8 pages, 4 Figures, 3 tables
★ Semantic Synergy: Unlocking Policy Insights and Learning Pathways Through Advanced Skill Mapping
This research introduces a comprehensive system based on state-of-the-art
natural language processing, semantic embedding, and efficient search
techniques for retrieving similarities and thus generating actionable insights
from raw textual information. The system automatically extracts and aggregates
normalized competencies from multiple documents (such as policy files and
curricula vitae) and creates strong relationships between recognized
competencies, occupation profiles, and related learning courses. To validate
its performance, we conducted a multi-tier evaluation that included both
explicit and implicit skill references in synthetic and real-world documents.
The results showed near-human-level accuracy, with F1 scores exceeding 0.95 for
explicit skill detection and above 0.93 for implicit mentions. The system
thereby establishes a sound foundation for supporting in-depth collaboration
across the AE4RIA network. The methodology involves a multi-stage pipeline
based on extensive preprocessing and data cleaning, semantic embedding and
segmentation via SentenceTransformer, and skill extraction using a FAISS-based
search method. The extracted skills are associated with occupation frameworks
(as formulated in the ESCO ontology) and with learning paths offered through
the Sustainable Development Goals Academy. Moreover, interactive visualization
software, implemented with Dash and Plotly, presents graphs and tables for
real-time exploration and informed decision-making by those involved in
policymaking, training and learning supply, career transitions, and
recruitment. Overall, this system, backed by rigorous validation, offers
promising prospects for improved policymaking, human resource development, and
lifelong learning by providing structured and actionable insights from raw,
complex textual information.
★ Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model
Reinforcement Learning (RL) algorithms for safety alignment of Large Language
Models (LLMs), such as Direct Preference Optimization (DPO), encounter the
challenge of distribution shift. Current approaches typically address this
issue through online sampling from the target policy, which requires
significant computational resources. In this paper, we hypothesize that during
off-policy training, while the ranking order of output generated by policy
changes, their overall distribution remains relatively stable. This stability
allows the transformation of the sampling process from the target policy into a
re-ranking of preference data. Building on this hypothesis, We propose a new
framework that leverages the model's intrinsic safety judgment capability to
extract reward signals, which are then used to calculate label confidence for
preferences reordering. Extensive experimental results and theoretical analysis
demonstrate that the proposed method effectively addresses the distribution
shift issue, remarkably enhancing the safety performance while reducing about
300x computational overheads.
★ Why Does Your CoT Prompt (Not) Work? Theoretical Analysis of Prompt Space Complexity, its Interaction with Answer Space During CoT Reasoning with LLMs: A Recurrent Perspective
Despite the remarkable successes of Large Language Models (LLMs), their
fundamental Transformer architecture possesses inherent theoretical limitations
that restrict their capability to handle reasoning tasks with increasing
computational complexity. Chain-of-Thought (CoT) prompting has emerged as a
practical solution, supported by several theoretical studies. However, current
CoT-based methods (including ToT, GoT, etc.) generally adopt a
"one-prompt-fits-all" strategy, using fixed templates (e.g., "think step by
step") across diverse reasoning tasks. This method forces models to navigate an
extremely complex prompt space to identify effective reasoning paths. The
current prompt designing research are also heavily relying on trial-and-error
rather than theoretically informed guidance. In this paper, we provide a
rigorous theoretical analysis of the complexity and interplay between two
crucial spaces: the prompt space (the space of potential prompt structures) and
the answer space (the space of reasoning solutions generated by LLMs) in CoT
reasoning. We demonstrate how reliance on a single universal prompt (e.g. think
step by step) can negatively impact the theoretical computability of LLMs,
illustrating that prompt complexity directly influences the structure and
effectiveness of the navigation in answer space. Our analysis highlights that
sometimes human supervision is critical for efficiently navigating the prompt
space. We theoretically and empirically show that task-specific prompting
significantly outperforms unsupervised prompt generation, emphasizing the
necessity of thoughtful human guidance in CoT prompting.
comment: arXiv admin note: substantial text overlap with arXiv:2410.14198
★ Information Density Principle for MLLM Benchmarks
Chunyi Li, Xiaozhe Li, Zicheng Zhang, Yuan Tian, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Jia Wang, Haodong Duan, Kai Chen, Guangtao Zhai
With the emergence of Multimodal Large Language Models (MLLMs), hundreds of
benchmarks have been developed to ensure the reliability of MLLMs in downstream
tasks. However, the evaluation mechanism itself may not be reliable. For
developers of MLLMs, questions remain about which benchmark to use and whether
the test results meet their requirements. Therefore, we propose a critical
principle of Information Density, which examines how much insight a benchmark
can provide for the development of MLLMs. We characterize it from four key
dimensions: (1) Fallacy, (2) Difficulty, (3) Redundancy, (4) Diversity. Through
a comprehensive analysis of more than 10,000 samples, we measured the
information density of 19 MLLM benchmarks. Experiments show that using the
latest benchmarks in testing can provide more insight compared to previous
ones, but there is still room for improvement in their information density. We
hope this principle can promote the development and application of future MLLM
benchmarks. Project page: https://github.com/lcysyzxdxc/bench4bench
★ Compute Optimal Scaling of Skills: Knowledge vs Reasoning
Scaling laws are a critical component of the LLM development pipeline, most
famously as a way to forecast training decisions such as 'compute-optimally'
trading-off parameter count and dataset size, alongside a more recent growing
list of other crucial decisions. In this work, we ask whether compute-optimal
scaling behaviour can be skill-dependent. In particular, we examine knowledge
and reasoning-based skills such as knowledge-based QA and code generation, and
we answer this question in the affirmative: $\textbf{scaling laws are
skill-dependent}$. Next, to understand whether skill-dependent scaling is an
artefact of the pretraining datamix, we conduct an extensive ablation of
different datamixes and find that, also when correcting for datamix
differences, $\textbf{knowledge and code exhibit fundamental differences in
scaling behaviour}$. We conclude with an analysis of how our findings relate to
standard compute-optimal scaling using a validation set, and find that
$\textbf{a misspecified validation set can impact compute-optimal parameter
count by nearly 50%,}$ depending on its skill composition.
★ Using Context to Improve Word Segmentation
An important step in understanding how children acquire languages is studying
how infants learn word segmentation. It has been established in previous
research that infants may use statistical regularities in speech to learn word
segmentation. The research of Goldwater et al., demonstrated that incorporating
context in models improves their ability to learn word segmentation. We
implemented two of their models, a unigram and bigram model, to examine how
context can improve statistical word segmentation. The results are consistent
with our hypothesis that the bigram model outperforms the unigram model at
predicting word segmentation. Extending the work of Goldwater et al., we also
explored basic ways to model how young children might use previously learned
words to segment new utterances.
★ ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content
Large Multimodal Models (LMMs) are increasingly vulnerable to AI-generated
extremist content, including photorealistic images and text, which can be used
to bypass safety mechanisms and generate harmful outputs. However, existing
datasets for evaluating LMM robustness offer limited exploration of extremist
content, often lacking AI-generated images, diverse image generation models,
and comprehensive coverage of historical events, which hinders a complete
assessment of model vulnerabilities. To fill this gap, we introduce
ExtremeAIGC, a benchmark dataset and evaluation framework designed to assess
LMM vulnerabilities against such content. ExtremeAIGC simulates real-world
events and malicious use cases by curating diverse text- and image-based
examples crafted using state-of-the-art image generation techniques. Our study
reveals alarming weaknesses in LMMs, demonstrating that even cutting-edge
safety measures fail to prevent the generation of extremist material. We
systematically quantify the success rates of various attack strategies,
exposing critical gaps in current defenses and emphasizing the need for more
robust mitigation strategies.
comment: Preprint
★ Take Off the Training Wheels Progressive In-Context Learning for Effective Alignment EMNLP2024
Recent studies have explored the working mechanisms of In-Context Learning
(ICL). However, they mainly focus on classification and simple generation
tasks, limiting their broader application to more complex generation tasks in
practice. To address this gap, we investigate the impact of demonstrations on
token representations within the practical alignment tasks. We find that the
transformer embeds the task function learned from demonstrations into the
separator token representation, which plays an important role in the generation
of prior response tokens. Once the prior response tokens are determined, the
demonstrations become redundant.Motivated by this finding, we propose an
efficient Progressive In-Context Alignment (PICA) method consisting of two
stages. In the first few-shot stage, the model generates several prior response
tokens via standard ICL while concurrently extracting the ICL vector that
stores the task function from the separator token representation. In the
following zero-shot stage, this ICL vector guides the model to generate
responses without further demonstrations.Extensive experiments demonstrate that
our PICA not only surpasses vanilla ICL but also achieves comparable
performance to other alignment tuning methods. The proposed training-free
method reduces the time cost (e.g., 5.45+) with improved alignment performance
(e.g., 6.57+). Consequently, our work highlights the application of ICL for
alignment and calls for a deeper understanding of ICL for complex generations.
The code will be available at https://github.com/HITsz-TMG/PICA.
comment: 15 pages, 9 figures, published in EMNLP2024
★ Developing and Evaluating an AI-Assisted Prediction Model for Unplanned Intensive Care Admissions following Elective Neurosurgery using Natural Language Processing within an Electronic Healthcare Record System
Julia Ive, Olatomiwa Olukoya, Jonathan P. Funnell, James Booker, Sze H M Lam, Ugan Reddy, Kawsar Noor, Richard JB Dobson, Astri M. V. Luoma, Hani J Marcus
Introduction: Timely care in a specialised neuro-intensive therapy unit (ITU)
reduces mortality and hospital stays, with planned admissions being safer than
unplanned ones. However, post-operative care decisions remain subjective. This
study used artificial intelligence (AI), specifically natural language
processing (NLP) to analyse electronic health records (EHRs) and predict ITU
admissions for elective surgery patients. Methods: This study analysed the EHRs
of elective neurosurgery patients from University College London Hospital
(UCLH) using NLP. Patients were categorised into planned high dependency unit
(HDU) or ITU admission; unplanned HDU or ITU admission; or ward / overnight
recovery (ONR). The Medical Concept Annotation Tool (MedCAT) was used to
identify SNOMED-CT concepts within the clinical notes. We then explored the
utility of these identified concepts for a range of AI algorithms trained to
predict ITU admission. Results: The CogStack-MedCAT NLP model, initially
trained on hospital-wide EHRs, underwent two refinements: first with data from
patients with Normal Pressure Hydrocephalus (NPH) and then with data from
Vestibular Schwannoma (VS) patients, achieving a concept detection F1-score of
0.93. This refined model was then used to extract concepts from EHR notes of
2,268 eligible neurosurgical patients. We integrated the extracted concepts
into AI models, including a decision tree model and a neural time-series model.
Using the simpler decision tree model, we achieved a recall of 0.87 (CI 0.82 -
0.91) for ITU admissions, reducing the proportion of unplanned ITU cases missed
by human experts from 36% to 4%. Conclusion: The NLP model, refined for
accuracy, has proven its efficiency in extracting relevant concepts, providing
a reliable basis for predictive AI models to use in clinically valid
applications.
★ PluralLLM: Pluralistic Alignment in LLMs via Federated Learning
Ensuring Large Language Models (LLMs) align with diverse human preferences
while preserving privacy and fairness remains a challenge. Existing methods,
such as Reinforcement Learning from Human Feedback (RLHF), rely on centralized
data collection, making them computationally expensive and privacy-invasive. We
introduce PluralLLM a federated learning-based approach that enables multiple
user groups to collaboratively train a transformer-based preference predictor
without sharing sensitive data, which can also serve as a reward model for
aligning LLMs. Our method leverages Federated Averaging (FedAvg) to aggregate
preference updates efficiently, achieving 46% faster convergence, a 4%
improvement in alignment scores, and nearly the same group fairness measure as
in centralized training. Evaluated on a Q/A preference alignment task,
PluralLLM demonstrates that federated preference learning offers a scalable and
privacy-preserving alternative for aligning LLMs with diverse human values.
♻ ★ Chain-of-Thought Reasoning In The Wild Is Not Always Faithful ICLR 25
Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy
Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art
AI capabilities. However, recent studies have shown that CoT reasoning is not
always faithful, i.e. CoT reasoning does not always reflect how models arrive
at conclusions. So far, most of these studies have focused on unfaithfulness in
unnatural contexts where an explicit bias has been introduced. In contrast, we
show that unfaithful CoT can occur on realistic prompts with no artificial
bias. Our results reveal non-negligible rates of several forms of unfaithful
reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and
ChatGPT-4o (7.0%) all answer a notable proportion of question pairs
unfaithfully. Specifically, we find that models rationalize their implicit
biases in answers to binary questions ("implicit post-hoc rationalization").
For example, when separately presented with the questions "Is X bigger than Y?"
and "Is Y bigger than X?", models sometimes produce superficially coherent
arguments to justify answering Yes to both questions or No to both questions,
despite such responses being logically contradictory. We also investigate
restoration errors (Dziri et al., 2023), where models make and then silently
correct errors in their reasoning, and unfaithful shortcuts, where models use
clearly illogical reasoning to simplify solving problems in Putnam questions (a
hard benchmark). Our findings raise challenges for AI safety work that relies
on monitoring CoT to detect undesired behavior.
comment: Accepted to the Reasoning and Planning for Large Language Models
Workshop (ICLR 25), 10 main paper pages, 38 appendix pages
♻ ★ DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback ICLR 2025
The process of creating training data to teach models is currently driven by
humans, who manually analyze model weaknesses and plan how to create data that
improves a student model. Approaches using LLMs as annotators reduce human
effort, but still require humans to interpret feedback from evaluations and
control the LLM to produce data the student needs. Automating this
labor-intensive process by creating autonomous data generation agents - or
teachers - is desirable, but requires environments that can simulate the
feedback-driven, iterative, closed loop of data creation. To enable rapid,
scalable testing for such agents and their modules, we introduce DataEnvGym, a
testbed of teacher environments for data generation agents. DataEnvGym frames
data generation as a sequential decision-making task, involving an agent
consisting of a data generation policy (which generates a plan for creating
training data) and a data generation engine (which transforms the plan into
data), inside an environment that provides student feedback. The agent's goal
is to improve student performance. Students are iteratively trained and
evaluated on generated data, and their feedback (in the form of errors or weak
skills) is reported to the agent after each iteration. DataEnvGym includes
multiple teacher environment instantiations across 3 levels of structure in the
state representation and action space. More structured environments are based
on inferred skills and offer more interpretability and curriculum control. We
support 4 domains (math, code, VQA, and tool-use) and test multiple students
and teachers. Example agents in our teaching environments can iteratively
improve students across tasks and settings. Moreover, we show that environments
teach different skill levels and test variants of key modules, pointing to
future work in improving data generation agents, engines, and feedback
mechanisms.
comment: ICLR 2025 Spotlight; Project Page: https://DataEnvGym.github.io
♻ ★ YouTube Comments Decoded: Leveraging LLMs for Low Resource Language Classification
Sarcasm detection is a significant challenge in sentiment analysis,
particularly due to its nature of conveying opinions where the intended meaning
deviates from the literal expression. This challenge is heightened in social
media contexts where code-mixing, especially in Dravidian languages, is
prevalent. Code-mixing involves the blending of multiple languages within a
single utterance, often with non-native scripts, complicating the task for
systems trained on monolingual data. This shared task introduces a novel gold
standard corpus designed for sarcasm and sentiment detection within code-mixed
texts, specifically in Tamil-English and Malayalam-English languages. The
primary objective of this task is to identify sarcasm and sentiment polarity
within a code-mixed dataset of Tamil-English and Malayalam-English comments and
posts collected from social media platforms. Each comment or post is annotated
at the message level for sentiment polarity, with particular attention to the
challenges posed by class imbalance, reflecting real-world scenarios.In this
work, we experiment with state-of-the-art large language models like GPT-3.5
Turbo via prompting to classify comments into sarcastic or non-sarcastic
categories. We obtained a macro-F1 score of 0.61 for Tamil language. We
obtained a macro-F1 score of 0.50 for Malayalam language.
comment: Updated and Final Version
♻ ★ Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity ICLR 2025
Architectures such as Linformer and Mamba have recently emerged as
competitive linear time replacements for transformers. However, corresponding
large pretrained models are often unavailable, especially in non-text domains.
To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD)
approach that jointly converts a transformer model to a linear time substitute
and fine-tunes it to a target task. We also compare several means to guide the
fine-tuning to optimally retain the desired inference capability from the
original model. The methods differ in their use of the target model and the
trajectory of the parameters. In a series of empirical studies on language
processing, language modeling, and speech processing, we show that CALD can
effectively recover the result of the original model, and that the guiding
strategy contributes to the result. Some reasons for the variation are
suggested.
comment: 18 pages, 5 figures; ICLR 2025 camera ready. Code:
https://github.com/idiap/linearize-distill-pretrained-transformers
♻ ★ Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation ICLR 2025
LLM self-evaluation relies on the LLM's own ability to estimate response
correctness, which can greatly improve its deployment reliability. In this
research track, we propose the Chain-of-Embedding (CoE) in the latent space to
enable LLMs to perform output-free self-evaluation. CoE consists of all
progressive hidden states produced during the inference time, which can be
treated as the latent thinking path of LLMs. We find that when LLMs respond
correctly and incorrectly, their CoE features differ, these discrepancies
assist us in estimating LLM response correctness. Experiments in four diverse
domains and seven LLMs fully demonstrate the effectiveness of our method.
Meanwhile, its label-free design intent without any training and
millisecond-level computational cost ensures real-time feedback in large-scale
scenarios. More importantly, we provide interesting insights into LLM response
correctness from the perspective of hidden state changes inside LLMs.
comment: Accepted by ICLR 2025
♻ ★ When Text Embedding Meets Large Language Model: A Comprehensive Survey
Text embedding has become a foundational technology in natural language
processing (NLP) during the deep learning era, driving advancements across a
wide array of downstream tasks. While many natural language understanding
challenges can now be modeled using generative paradigms and leverage the
robust generative and comprehension capabilities of large language models
(LLMs), numerous practical applications-such as semantic matching, clustering,
and information retrieval-continue to rely on text embeddings for their
efficiency and effectiveness. Therefore, how to combine the LLMs and the text
embeddings has become one of the hotspots of academic attention in recent
years. In this survey, we categorize the interplay between LLMs and text
embeddings into three overarching themes: (1) LLM-augmented text embedding,
enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders,
adapting their innate capabilities for high-quality embedding; and (3) Text
embedding understanding with LLMs, leveraging LLMs to analyze and interpret
embeddings. By organizing recent works based on interaction patterns rather
than specific downstream applications, we offer a novel and systematic overview
of contributions from various research and application domains in the era of
LLMs. Furthermore, we highlight the unresolved challenges that persisted in the
pre-LLM era with pre-trained language models (PLMs) and explore the emerging
obstacles brought forth by LLMs. Building on this analysis, we outline
prospective directions for the evolution of text embedding, addressing both
theoretical and practical opportunities in the rapidly advancing landscape of
NLP.
comment: Work in progress
♻ ★ InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Advanced reasoning in large language models has achieved remarkable
performance on challenging tasks, but the prevailing long-context reasoning
paradigm faces critical limitations: quadratic computational scaling with
sequence length, reasoning constrained by maximum context boundaries, and
performance degradation beyond pre-training context windows. Existing
approaches primarily compress reasoning chains without addressing the
fundamental scaling problem. To overcome these challenges, we introduce
InftyThink, a paradigm that transforms monolithic reasoning into an iterative
process with intermediate summarization. By interleaving short reasoning
segments with concise progress summaries, our approach enables unbounded
reasoning depth while maintaining bounded computational costs. This creates a
characteristic sawtooth memory pattern that significantly reduces computational
complexity compared to traditional approaches. Furthermore, we develop a
methodology for reconstructing long-context reasoning datasets into our
iterative format, transforming OpenR1-Math into 333K training instances.
Experiments across multiple model architectures demonstrate that our approach
reduces computational costs while improving performance, with Qwen2.5-Math-7B
showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks.
Our work challenges the assumed trade-off between reasoning depth and
computational efficiency, providing a more scalable approach to complex
reasoning without architectural modifications.
♻ ★ DataMan: Data Manager for Pre-training Large Language Models ICLR2025
The performance emergence of large language models (LLMs) driven by data
scaling laws makes the selection of pre-training data increasingly important.
However, existing methods rely on limited heuristics and human intuition,
lacking comprehensive and clear guidelines. To address this, we are inspired by
``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit
its performance. As its pre-training capabilities are related to perplexity
(PPL), we derive 14 quality criteria from the causes of text perplexity
anomalies and introduce 15 common application domains to support domain mixing.
In this paper, we train a Data Manager (DataMan) to learn quality ratings and
domain recognition from pointwise rating, and use it to annotate a 447B token
pre-training corpus with 14 quality ratings and domain type. Our experiments
validate our approach, using DataMan to select 30B tokens to train a
1.3B-parameter language model, demonstrating significant improvements in
in-context learning (ICL), perplexity, and instruction-following ability over
the state-of-the-art baseline. The best-performing model, based on the Overall
Score l=5 surpasses a model trained with 50% more data using uniform sampling.
We continue pre-training with high-rated, domain-specific data annotated by
DataMan to enhance domain-specific ICL performance and thus verify DataMan's
domain mixing ability. Our findings emphasize the importance of quality
ranking, the complementary nature of quality criteria, and their low
correlation with perplexity, analyzing misalignment between PPL and ICL
performance. We also thoroughly analyzed our pre-training dataset, examining
its composition, the distribution of quality ratings, and the original document
sources.
comment: ICLR2025 paper
♻ ★ MastermindEval: A Simple But Scalable Reasoning Benchmark ICLR 2025
Recent advancements in large language models (LLMs) have led to remarkable
performance across a wide range of language understanding and mathematical
tasks. As a result, increasing attention has been given to assessing the true
reasoning capabilities of LLMs, driving research into commonsense, numerical,
logical, and qualitative reasoning. However, with the rapid progress of
reasoning-focused models such as OpenAI's o1 and DeepSeek's R1, there has been
a growing demand for reasoning benchmarks that can keep pace with ongoing model
developments. In this paper, we introduce MastermindEval, a simple, scalable,
and interpretable deductive reasoning benchmark inspired by the board game
Mastermind. Our benchmark supports two evaluation paradigms: (1) agentic
evaluation, in which the model autonomously plays the game, and (2) deductive
reasoning evaluation, in which the model is given a pre-played game state with
only one possible valid code to infer. In our experimental results we (1) find
that even easy Mastermind instances are difficult for current models and (2)
demonstrate that the benchmark is scalable to possibly more advanced models in
the future Furthermore, we investigate possible reasons why models cannot
deduce the final solution and find that current models are limited in deducing
the concealed code as the number of statement to combine information from is
increasing.
comment: 9 pages, 2 figures, 4 tables. In: ICLR 2025 Workshop on Reasoning and
Planning for Large Language Models
♻ ★ EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions CVPR 2025
Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Jun Yao, Lanqing Hong, Lu Hou, Hang Xu
GPT-4o, an omni-modal model that enables vocal conversations with diverse
emotions and tones, marks a milestone for omni-modal foundation models.
However, empowering Large Language Models to perceive and generate images,
texts, and speeches end-to-end with publicly available data remains challenging
for the open-source community. Existing vision-language models rely on external
tools for speech processing, while speech-language models still suffer from
limited or totally without vision-understanding capabilities. To address this
gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable
Large Language Models with end-to-end speech abilities while maintaining the
leading vision-language performance. With a semantic-acoustic disentangled
speech tokenizer, we surprisingly notice that omni-modal alignment can further
enhance vision-language and speech abilities compared with the bi-modal aligned
counterparts. Moreover, a lightweight style module is introduced for the
flexible speech style controls including emotions and pitches. For the first
time, EMOVA achieves state-of-the-art performance on both the vision-language
and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue
with vivid emotions.
comment: Accepted by CVPR 2025. Project Page: https://emova-ollm.github.io/
♻ ★ MIX : a Multi-task Learning Approach to Solve Open-Domain Question Answering
This paper introduces MIX, a multi-task deep learning approach to solve
open-ended question-answering. First, we design our system as a multi-stage
pipeline of 3 building blocks: a BM25-based Retriever to reduce the search
space, a RoBERTa-based Scorer, and an Extractor to rank retrieved paragraphs
and extract relevant text spans, respectively. Eventually, we further improve
the computational efficiency of our system to deal with the scalability
challenge: thanks to multi-task learning, we parallelize the close tasks solved
by the Scorer and the Extractor. Our system is on par with state-of-the-art
performances on the squad-open benchmark while being simpler conceptually.
comment: 8 pages, 7 figures, 3 tables
♻ ★ PAD: Personalized Alignment of LLMs at Decoding-Time ICLR 2025
Aligning with personalized preferences, which vary significantly across
cultural, educational, and political differences, poses a significant challenge
due to the computational costs and data demands of traditional alignment
methods. In response, this paper presents Personalized Alignment at
Decoding-time (PAD), a novel framework designed to align LLM outputs with
diverse personalized preferences during the inference phase, eliminating the
need for additional training. By introducing a unique personalized reward
modeling strategy, this framework decouples the text generation process from
personalized preferences, facilitating the generation of generalizable
token-level personalized rewards. The PAD algorithm leverages these rewards to
guide the decoding process, dynamically tailoring the base model's predictions
to personalized preferences. Extensive experimental results demonstrate that
PAD not only outperforms existing training-based alignment methods in terms of
aligning with diverse preferences but also shows significant generalizability
to preferences unseen during training and scalability across different base
models. This work advances the capability of LLMs to meet user needs in
real-time applications, presenting a substantial step forward in personalized
LLM alignment.
comment: ICLR 2025
♻ ★ Diabetica: Adapting Large Language Model to Enhance Multiple Medical Tasks in Diabetes Care and Management ICLR 2025
Lai Wei, Zhen Ying, Muyang He, Yutong Chen, Qian Yang, Yanzhe Hong, Jiaping Lu, Kaipeng Zheng, Shaoting Zhang, Xiaoying Li, Weiran Huang, Ying Chen
Diabetes is a chronic disease with a significant global health burden,
requiring multi-stakeholder collaboration for optimal management. Large
language models (LLMs) have shown promise in various healthcare scenarios, but
their effectiveness across diverse diabetes tasks remains unproven. Our study
introduced a framework to train and validate diabetes-specific LLMs. We first
developed a comprehensive data processing pipeline that includes data
collection, filtering, augmentation and refinement. This created a
high-quality, diabetes-specific dataset and evaluation benchmarks from scratch.
Fine-tuned on the collected training dataset, our diabetes-specific LLM family
demonstrated state-of-the-art proficiency in processing various diabetes tasks
compared to other LLMs. Furthermore, clinical studies revealed the potential
applications of our models in diabetes care, including providing personalized
healthcare, assisting medical education, and streamlining clinical tasks.
Generally, our introduced framework helps develop diabetes-specific LLMs and
highlights their potential to enhance clinical practice and provide
personalized, data-driven support for diabetes management across different end
users. Our codes, benchmarks and models are available at
https://github.com/waltonfuture/Diabetica.
comment: Accepted by ICLR 2025 SCI-FM workshop
♻ ★ Adapting Multilingual Embedding Models to Historical Luxembourgish
The growing volume of digitized historical texts requires effective semantic
search using text embeddings. However, pre-trained multilingual models face
challenges with historical content due to OCR noise and outdated spellings.
This study examines multilingual embeddings for cross-lingual semantic search
in historical Luxembourgish (LB), a low-resource language. We collect
historical Luxembourgish news articles from various periods and use GPT-4o for
sentence segmentation and translation, generating 20,000 parallel training
sentences per language pair. Additionally, we create a semantic search
(Historical LB Bitext Mining) evaluation set and find that existing models
perform poorly on cross-lingual search for historical Luxembourgish. Using our
historical and additional modern parallel training data, we adapt several
multilingual embedding models through contrastive learning or knowledge
distillation and increase accuracy significantly for all models. We release our
adapted models and historical Luxembourgish-German/French/English bitexts to
support further research.
comment: To appear in LaTeCH-CLfL 2025
♻ ★ FIND: Fine-grained Information Density Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis
Retrieval-Augmented Large Language Models (LLMs), which integrate external
knowledge into LLMs, have shown remarkable performance in various medical
domains, including clinical diagnosis. However, existing RAG methods struggle
to effectively assess task difficulty to make retrieval decisions, thereby
failing to meet the clinical requirements for balancing efficiency and
accuracy. So in this paper, we propose FIND (\textbf{F}ine-grained
\textbf{In}formation \textbf{D}ensity Guided Adaptive RAG), a novel framework
that improves the reliability of RAG in disease diagnosis scenarios. FIND
incorporates a fine-grained adaptive control module to determine whether
retrieval is necessary based on the information density of the input. By
optimizing the retrieval process and implementing a knowledge filtering module,
FIND ensures that the retrieval is better suited to clinical scenarios.
Experiments on three Chinese electronic medical record datasets demonstrate
that FIND significantly outperforms various baseline methods, highlighting its
effectiveness in clinical diagnosis tasks.
♻ ★ Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing
Knowledge tracing (KT) is a popular approach for modeling students' learning
progress over time, which can enable more personalized and adaptive learning.
However, existing KT approaches face two major limitations: (1) they rely
heavily on expert-defined knowledge concepts (KCs) in questions, which is
time-consuming and prone to errors; and (2) KT methods tend to overlook the
semantics of both questions and the given KCs. In this work, we address these
challenges and present KCQRL, a framework for automated knowledge concept
annotation and question representation learning that can improve the
effectiveness of any existing KT model. First, we propose an automated KC
annotation process using large language models (LLMs), which generates question
solutions and then annotates KCs in each solution step of the questions.
Second, we introduce a contrastive learning approach to generate semantically
rich embeddings for questions and solution steps, aligning them with their
associated KCs via a tailored false negative elimination approach. These
embeddings can be readily integrated into existing KT models, replacing their
randomly initialized embeddings. We demonstrate the effectiveness of KCQRL
across 15 KT algorithms on two large real-world Math learning datasets, where
we achieve consistent performance improvements.
♻ ★ Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs
This work adapts and studies the gradient-based Membership Inference Test
(gMINT) to the classification of text based on LLMs. MINT is a general approach
intended to determine if given data was used for training machine learning
models, and this work focuses on its application to the domain of Natural
Language Processing. Using gradient-based analysis, the MINT model identifies
whether particular data samples were included during the language model
training phase, addressing growing concerns about data privacy in machine
learning. The method was evaluated in seven Transformer-based models and six
datasets comprising over 2.5 million sentences, focusing on text classification
tasks. Experimental results demonstrate MINTs robustness, achieving AUC scores
between 85% and 99%, depending on data size and model architecture. These
findings highlight MINTs potential as a scalable and reliable tool for auditing
machine learning models, ensuring transparency, safeguarding sensitive data,
and fostering ethical compliance in the deployment of AI/NLP technologies.
♻ ★ DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs
Minxuan Lv, Zhenpeng Su, Leiyu Pan, Yizhe Xiong, Zijia Lin, Hui Chen, Wei Zhou, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Songlin Hu
As large language models continue to scale, computational costs and resource
consumption have emerged as significant challenges. While existing
sparsification methods like pruning reduce computational overhead, they risk
losing model knowledge through parameter removal. This paper proposes DSMoE
(Dynamic Sparse Mixture-of-Experts), a novel approach that achieves
sparsification by partitioning pre-trained FFN layers into computational
blocks. We implement adaptive expert routing using sigmoid activation and
straight-through estimators, enabling tokens to flexibly access different
aspects of model knowledge based on input complexity. Additionally, we
introduce a sparsity loss term to balance performance and computational
efficiency. Extensive experiments on LLaMA models demonstrate that under
equivalent computational constraints, DSMoE achieves superior performance
compared to existing pruning and MoE approaches across language modeling and
downstream tasks, particularly excelling in generation tasks. Analysis reveals
that DSMoE learns distinctive layerwise activation patterns, providing new
insights for future MoE architecture design.
♻ ★ Evaluating LLMs and Pre-trained Models for Text Summarization Across Diverse Datasets
Tohida Rehman, Soumabha Ghosh, Kuntal Das, Souvik Bhattacharjee, Debarshi Kumar Sanyal, Samiran Chattopadhyay
Text summarization plays a crucial role in natural language processing by
condensing large volumes of text into concise and coherent summaries. As
digital content continues to grow rapidly and the demand for effective
information retrieval increases, text summarization has become a focal point of
research in recent years. This study offers a thorough evaluation of four
leading pre-trained and open-source large language models: BART, FLAN-T5,
LLaMA-3-8B, and Gemma-7B, across five diverse datasets CNN/DM, Gigaword, News
Summary, XSum, and BBC News. The evaluation employs widely recognized automatic
metrics, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR, to assess
the models' capabilities in generating coherent and informative summaries. The
results reveal the comparative strengths and limitations of these models in
processing various text types.
comment: 5 pages, 2 figures, 6 tables
♻ ★ Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Generating accurate SQL from users' natural language questions (text-to-SQL)
remains a long-standing challenge due to the complexities involved in user
question understanding, database schema comprehension, and SQL generation.
Traditional text-to-SQL systems, which combine human engineering and deep
neural networks, have made significant progress. Subsequently, pre-trained
language models (PLMs) have been developed for text-to-SQL tasks, achieving
promising results. However, as modern databases and user questions grow more
complex, PLMs with a limited parameter size often produce incorrect SQL. This
necessitates more sophisticated and tailored optimization methods, which
restricts the application of PLM-based systems. Recently, large language models
(LLMs) have shown significant capabilities in natural language understanding as
model scale increases. Thus, integrating LLM-based solutions can bring unique
opportunities, improvements, and solutions to text-to-SQL research. In this
survey, we provide a comprehensive review of existing LLM-based text-to-SQL
studies. Specifically, we offer a brief overview of the technical challenges
and evolutionary process of text-to-SQL. Next, we introduce the datasets and
metrics designed to evaluate text-to-SQL systems. Subsequently, we present a
systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we
make a summarization and discuss the remaining challenges in this field and
suggest expectations for future research directions.
♻ ★ Computational Law: Datasets, Benchmarks, and Ontologies
Recent developments in computer science and artificial intelligence have also
contributed to the legal domain, as revealed by the number and range of related
publications and applications. Machine and deep learning models require
considerable amount of domain-specific data for training and comparison
purposes, in order to attain high-performance in the legal domain.
Additionally, semantic resources such as ontologies are valuable for building
large-scale computational legal systems, in addition to ensuring
interoperability of such systems. Considering these aspects, we present an
up-to-date review of the literature on datasets, benchmarks, and ontologies
proposed for computational law. We believe that this comprehensive and recent
review will help researchers and practitioners when developing and testing
approaches and systems for computational law.
♻ ★ TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
In the domain of complex reasoning tasks, such as mathematical reasoning,
recent advancements have proposed the use of Direct Preference Optimization
(DPO) to suppress output of dispreferred responses, thereby enhancing the
long-chain reasoning capabilities of large language models (LLMs). To this end,
these studies employed LLMs to generate preference trees via Tree-of-thoughts
(ToT) and sample the paired preference responses required by the DPO algorithm.
However, the DPO algorithm based on binary preference optimization is unable to
learn multiple responses with varying degrees of preference/dispreference that
provided by the preference trees, resulting in incomplete preference learning.
In this work, we introduce Tree Preference Optimization (TPO), that does not
sample paired preference responses from the preference tree; instead, it
directly learns from the entire preference tree during the fine-tuning.
Specifically, TPO formulates the language model alignment as a Preference List
Ranking problem, where the policy can potentially learn more effectively from a
ranked preference list of responses given the prompt. In addition, to further
assist LLMs in identifying discriminative steps within long-chain reasoning and
increase the relative reward margin in the preference list, TPO utilizes
Adaptive Step Reward to adjust the reward values of each step in trajectory for
performing fine-grained preference optimization. We carry out extensive
experiments on mathematical reasoning tasks to evaluate TPO. The experimental
results indicate that TPO consistently outperforms DPO across five public large
language models on four datasets.
♻ ★ Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks ECCV 2024
Recent vision-language foundation models, such as CLIP, have demonstrated
superior capabilities in learning representations that can be transferable
across diverse range of downstream tasks and domains. With the emergence of
such powerful models, it has become crucial to effectively leverage their
capabilities in tackling challenging vision tasks. On the other hand, only a
few works have focused on devising adversarial examples that transfer well to
both unknown domains and model architectures. In this paper, we propose a novel
transfer attack method called PDCL-Attack, which leverages the CLIP model to
enhance the transferability of adversarial perturbations generated by a
generative model-based attack framework. Specifically, we formulate an
effective prompt-driven feature guidance by harnessing the semantic
representation power of text, particularly from the ground-truth class labels
of input images. To the best of our knowledge, we are the first to introduce
prompt learning to enhance the transferable generative attacks. Extensive
experiments conducted across various cross-domain and cross-model settings
empirically validate our approach, demonstrating its superiority over
state-of-the-art methods.
comment: Accepted to ECCV 2024 (Oral), Project Page:
https://PDCL-Attack.github.io
♻ ★ Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training CVPR 2025
In this paper, we focus on monolithic Multimodal Large Language Models
(MLLMs) that integrate visual encoding and language decoding into a single LLM.
In particular, we identify that existing pre-training strategies for monolithic
MLLMs often suffer from unstable optimization or catastrophic forgetting. To
address this issue, our core idea is to embed a new visual parameter space into
a pre-trained LLM, thereby stably learning visual knowledge from noisy data
while freezing the LLM. Based on this principle, we present Mono-InternVL, a
novel monolithic MLLM that seamlessly integrates a set of visual experts via a
multimodal mixture-of-experts structure. Moreover, we propose an innovative
pre-training strategy to maximize the visual capability of Mono-InternVL,
namely Endogenous Visual Pre-training (EViP). In particular, EViP is designed
as a progressive learning process for visual experts, which aims to fully
exploit the visual knowledge from noisy data to high-quality data. To validate
our approach, we conduct extensive experiments on 16 benchmarks. Experimental
results confirm the superior performance of Mono-InternVL than existing
monolithic MLLMs on 13 of 16 multimodal benchmarks, e.g., +80 points over Emu3
on OCRBench. Compared to the modular baseline, i.e., InternVL-1.5,
Mono-InternVL still retains comparable multimodal performance while reducing up
to 67% first token latency. Code and model are released at
https://github.com/OpenGVLab/Mono-InternVL.
comment: Accepted by CVPR 2025
♻ ★ Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che
Recent advancements in reasoning with large language models (RLLMs), such as
OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in
complex domains like mathematics and coding. A central factor in their success
lies in the application of long chain-of-thought (Long CoT) characteristics,
which enhance reasoning abilities and enable the solution of intricate
problems. However, despite these developments, a comprehensive survey on Long
CoT is still lacking, limiting our understanding of its distinctions from
traditional short chain-of-thought (Short CoT) and complicating ongoing debates
on issues like "overthinking" and "test-time scaling." This survey seeks to
fill this gap by offering a unified perspective on Long CoT. (1) We first
distinguish Long CoT from Short CoT and introduce a novel taxonomy to
categorize current reasoning paradigms. (2) Next, we explore the key
characteristics of Long CoT: deep reasoning, extensive exploration, and
feasible reflection, which enable models to handle more complex tasks and
produce more efficient, coherent outcomes compared to the shallower Short CoT.
(3) We then investigate key phenomena such as the emergence of Long CoT with
these characteristics, including overthinking, and test-time scaling, offering
insights into how these processes manifest in practice. (4) Finally, we
identify significant research gaps and highlight promising future directions,
including the integration of multi-modal reasoning, efficiency improvements,
and enhanced knowledge frameworks. By providing a structured overview, this
survey aims to inspire future research and further the development of logical
reasoning in artificial intelligence.
comment: Paper are available at https://long-cot.github.io/
♻ ★ MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference NAACL 2025
Long-context Multimodal Large Language Models (MLLMs) that incorporate long
text-image and text-video modalities, demand substantial resources as their
multimodal Key-Value (KV) caches grow with increasing input lengths,
challenging inference efficiency. Existing methods for KV cache compression, in
both text-only and multimodal LLMs, have neglected attention density variations
across layers, thus often adopting uniform or progressive reduction strategies
for layer-wise cache allocation. In this work, we propose MEDA, a dynamic
layer-wise KV cache allocation method for efficient multimodal long-context
inference. As its core, MEDA utilizes cross-modal attention entropy to
determine the KV cache size at each MLLMs layer. Given the dynamically
allocated KV cache size at each layer, MEDA also employs a KV pair selection
scheme to identify which KV pairs to select and a KV pair merging strategy that
merges the selected and non-selected ones to preserve information from the
entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82
times faster decoding speed, while maintaining or enhancing performance on
various multimodal tasks in long-context settings, including multi-images and
long-video scenarios. Our code is released at
https://github.com/AIoT-MLSys-Lab/MEDA.
comment: NAACL 2025 Main
♻ ★ Multi-agent KTO: Reinforcing Strategic Interactions of Large Language Model in Language Game
Achieving Artificial General Intelligence (AGI) requires AI agents that can
not only make stratigic decisions but also engage in flexible and meaningful
communication. Inspired by Wittgenstein's language game theory in Philosophical
Investigations, we propose that language agents can learn through in-context
interaction rather than traditional multi-stage frameworks that separate
decision-making from language expression. Using Werewolf, a social deduction
game that tests language understanding, strategic interaction, and
adaptability, we develop the Multi-agent Kahneman & Tversky's Optimization
(MaKTO). MaKTO engages diverse models in extensive gameplay to generate
unpaired desirable and unacceptable responses, then employs KTO to refine the
model's decision-making process. In 9-player Werewolf games, MaKTO achieves a
61% average win rate across various models, outperforming GPT-4o and two-stage
RL agents by relative improvements of 23.0% and 10.9%, respectively. Notably,
MaKTO also demonstrates human-like performance, winning 60% against expert
players and showing only 49% detectability in Turing-style blind tests.
comment: Preprint. Code and data will be available at
https://reneeye.github.io/MaKTO.html
♻ ★ Punctuation restoration improves structure understanding without supervision RepL4NLP 2025
Unsupervised learning objectives like autoregressive and masked language
modeling constitute a significant part in producing pre-trained representations
that perform various downstream applications from natural language
understanding to conversational tasks. However, despite impressive generative
capabilities of recent large language models, their abilities to capture
syntactic or semantic structure within text lag behind. We hypothesize that the
mismatch between linguistic performance and competence in machines is
attributable to insufficient learning of linguistic structure knowledge via
currently popular pre-training objectives. Working with English, we show that
punctuation restoration as a learning objective improves performance on
structure-related tasks like named entity recognition, open information
extraction, chunking, and part-of-speech tagging. Punctuation restoration
results in $\blacktriangle$$\geq2\%$p improvement in 16 out of 18 experiments,
across 6 out of 7 tasks. Our results show that punctuation restoration is an
effective learning objective that can improve structure understanding and yield
a more robust structure-aware representations of natural language in base-sized
models.
comment: 11 pages, 1 figure, 6 tables. RepL4NLP 2025
♻ ★ SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
Sparse autoencoders (SAEs) are a popular technique for interpreting language
model activations, and there is extensive recent work on improving SAE
effectiveness. However, most prior work evaluates progress using unsupervised
proxy metrics with unclear practical relevance. We introduce SAEBench, a
comprehensive evaluation suite that measures SAE performance across seven
diverse metrics, spanning interpretability, feature disentanglement and
practical applications like unlearning. To enable systematic comparison, we
open-source a suite of over 200 SAEs across eight recently proposed SAE
architectures and training algorithms. Our evaluation reveals that gains on
proxy metrics do not reliably translate to better practical performance. For
instance, while Matryoshka SAEs slightly underperform on existing proxy
metrics, they substantially outperform other architectures on feature
disentanglement metrics; moreover, this advantage grows with SAE scale. By
providing a standardized framework for measuring progress in SAE development,
SAEBench enables researchers to study scaling trends and make nuanced
comparisons between different SAE architectures and training methodologies. Our
interactive interface enables researchers to flexibly visualize relationships
between metrics across hundreds of open-source SAEs at: https://saebench.xyz
♻ ★ D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models ICLR 2025
Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, Mi Zhang
Generative inference in Large Language Models (LLMs) is impeded by the
growing memory demands of Key-Value (KV) cache, especially for longer
sequences. Traditional KV cache eviction strategies, which discard less
critical KV pairs based on attention scores, often degrade generation quality,
leading to issues such as context loss or hallucinations. In this work, we
introduce Dynamic Discriminative Operations (D2O), a KV cache compression
method that optimizes KV cache size dynamically and discriminatively at two
levels without fine-tuning, while preserving essential context. At layer level,
D2O leverages the varying densities of attention weights between shallow and
deep layers to dynamically determine which layers should avoid excessive
eviction via a novel dynamic allocation strategy to minimize information loss.
At token level, D2O incorporates a compensation mechanism that maintains a
similarity threshold to re-discriminate the importance of currently discarded
tokens, determining whether they should be recalled and merged with similar
tokens. We conduct experiments on various benchmarks and LLM architectures. Our
results show that D2O not only achieves significant memory savings and enhances
inference throughput by more than 3$\times$ but also maintains high-quality
long-text generation.
comment: ICLR 2025
♻ ★ Grounding Natural Language to SQL Translation with Data-Based Self-Explanations ICDE2025
Natural Language Interfaces for Databases empower non-technical users to
interact with data using natural language (NL). Advanced approaches, utilizing
either neural sequence-to-sequence or more recent sophisticated large-scale
language models, typically implement NL to SQL (NL2SQL) translation in an
end-to-end fashion. However, like humans, these end-to-end translation models
may not always generate the best SQL output on their first try. In this paper,
we propose CycleSQL, an iterative framework designed for end-to-end translation
models to autonomously generate the best output through self-evaluation. The
main idea of CycleSQL is to introduce data-grounded NL explanations of query
results as self-provided feedback, and use the feedback to validate the
correctness of the translation iteratively, hence improving the overall
translation accuracy. Extensive experiments, including quantitative and
qualitative evaluations, are conducted to study CycleSQL by applying it to
seven existing translation models on five widely used benchmarks. The results
show that 1) the feedback loop introduced in CycleSQL can consistently improve
the performance of existing models, and in particular, by applying CycleSQL to
RESDSQL, obtains a translation accuracy of 82.0% (+2.6%) on the validation set,
and 81.6% (+3.2%) on the test set of Spider benchmark; 2) the generated NL
explanations can also provide insightful information for users, aiding in the
comprehension of translation results and consequently enhancing the
interpretability of NL2SQL translation.
comment: ICDE2025
♻ ★ Enhancing Chain of Thought Prompting in Large Language Models via Reasoning Patterns
Chain of Thought (CoT) prompting can encourage language models to engage in
multi-step logical reasoning. The quality of the provided demonstrations
significantly influences the success of downstream inference tasks. Current
unsupervised CoT methods primarily select examples based on the semantics of
the questions, which can introduce noise and lack interpretability. In this
paper, we propose leveraging reasoning patterns to enhance CoT prompting
effectiveness. Reasoning patterns represent the process by which language
models arrive at their final results. By utilizing prior knowledge and
prompt-based methods from large models, we first construct task-specific
pattern sets. We then select diverse demonstrations based on different
reasoning patterns. This approach not only mitigates the impact of noise but
also provides explicit interpretability to help us understand the mechanisms of
CoT. Extensive experiments demonstrate that our method is more robust and
consistently leads to improvements across various reasoning tasks.
♻ ★ Preference Alignment for Diffusion Model via Explicit Denoised Distribution Estimation
Diffusion models have shown remarkable success in text-to-image generation,
making preference alignment for these models increasingly important. The
preference labels are typically available only at the terminal of denoising
trajectories, which poses challenges in optimizing the intermediate denoising
steps. In this paper, we propose to conduct Denoised Distribution Estimation
(DDE) that explicitly connects intermediate steps to the terminal denoised
distribution. Therefore, preference labels can be used for the entire
trajectory optimization. To this end, we design two estimation strategies for
our DDE. The first is stepwise estimation, which utilizes the conditional
denoised distribution to estimate the model denoised distribution. The second
is single-shot estimation, which converts the model output into the terminal
denoised distribution via DDIM modeling. Analytically and empirically, we
reveal that DDE equipped with two estimation strategies naturally derives a
novel credit assignment scheme that prioritizes optimizing the middle part of
the denoising trajectory. Extensive experiments demonstrate that our approach
achieves superior performance, both quantitatively and qualitatively.
♻ ★ MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models AAAI-25
Medical Large Language Models (MLLMs) have demonstrated potential in
healthcare applications, yet their propensity for hallucinations -- generating
medically implausible or inaccurate information -- presents substantial risks
to patient care. This paper introduces MedHallBench, a comprehensive benchmark
framework for evaluating and mitigating hallucinations in MLLMs. Our
methodology integrates expert-validated medical case scenarios with established
medical databases to create a robust evaluation dataset. The framework employs
a sophisticated measurement system that combines automated ACHMI (Automatic
Caption Hallucination Measurement in Medical Imaging) scoring with rigorous
clinical expert evaluations and utilizes reinforcement learning methods to
achieve automatic annotation. Through an optimized reinforcement learning from
human feedback (RLHF) training pipeline specifically designed for medical
applications, MedHallBench enables thorough evaluation of MLLMs across diverse
clinical contexts while maintaining stringent accuracy standards. We conducted
comparative experiments involving various models, utilizing the benchmark to
establish a baseline for widely adopted large language models (LLMs). Our
findings indicate that ACHMI provides a more nuanced understanding of the
effects of hallucinations compared to traditional metrics, thereby highlighting
its advantages in hallucination assessment. This research establishes a
foundational framework for enhancing MLLMs' reliability in healthcare settings
and presents actionable strategies for addressing the critical challenge of AI
hallucinations in medical applications.
comment: Published to AAAI-25 Bridge Program